Music Similarity Detection Guided by Deep Learning Model

,


Introduction
Te public widely acquires digital music through diferent media. With the expansion of public demand, Music Information Retrieval (MIR) has been developed to ofer users more convenient and accurate access to their preferred music. Te core issue in MIR is music classifcation according to the music style, the emotions conveyed by music, and diferent singers to satisfy as many users as possible. A good MIR can stimulate people's interest in searching for their favorite music and allows developers to manage diferent music more efectively. However, the structural characteristics of the same musical style may signifcantly vary since there are changes in singing venues, musical instruments, and singers when they sing the music repertoires. Even when the same singer sings the same song again, the structural characteristics will change due to different ranges.
Te public can extensively acquire digital music through diferent media. With increasing public demand, Music Information Retrieval (MIR) has been developed to enable users to conveniently and accurately fnd the music they are interested in. Te core task of MIR is the identifcation of musical styles, that is, to identify the similarity between music pieces, it is possible to classify the style of music and express emotions and diferent singers and other factors by detecting the similarity between pieces of music. It allows developers to manage diferent music efciently. However, since people sing many music repertoires, the structural characteristics of musical styles are quite diferent due to the changes in singing venues, musical instruments, and singers. Even if the same singer sings the same song again, the structural characteristics will change due to the diferent use of the vocal range. Terefore, since the current MIR system is not perfect, a music similarity detection algorithm that simulates human ear cognition to deeply analyze music signals is needed to improve the accuracy of music recommendations.
It is very convenient to use deep learning (DL) algorithms to extract features to complete detection tasks. DL technology analyzes and processes complex multidimensional data through a hierarchical structure. Each layer in the structure is composed of small units of feature detectors. Te low-level structure frst detects simple features and transfers them to high-level ones. Te high-level detection process obtains complex features. Simply put, the core idea of the DL algorithm is to obtain more complex and deep feature expression through the superposition of multiple nonlinear processing units. In other words, it fnally obtains the hierarchical feature expression of the original music information through the analysis and processing of data transfer between layers. Te principle of DL is to process information by imitating the human brain structure. Correspondingly, this algorithm stores a large amount of data in advance, analyzes the correlation between the internal information, and mines the core features of the data to improve the detection and classifcation performance. DL is essentially a type of large, complex, and deep-level neural network. Te current research results are primarily single studies rather than systematic implementation and application. Sheikh Fathollahi and Razzazi designed a similarity and music recommendation system by considering the cosine similarity and Euclidean distance between feature vectors [1]. Purwins et al. determined the key issues and future issues of the application of DL in audio signal processing [2]. Zinemanas et al. proposed a novel interpretable DL model for automatic sound classifcation based on the similarity of the input to a set of learned prototypes in the latent space to explain its predictions. Te proposed model achieved comparable results to state-of-the-art methods on three diferent sound classifcation tasks involving speech, music, and ambient audio [3].
Te convolutional neural network (CNN) in the DL algorithms reported here principally uses the Harmony and Percussive Source Separation (HPSS) algorithm to process the spectrogram separation of the original music signal. Te processed data are input into the CNN for processing. Ten, the efect of the training-related hyperparameters on the detection rate is studied through specifc parameter adjustment and the expansion of the dataset. Experimental results demonstrate that this scheme can efectively improve music similarity detection (MSD) using a single feature.
Tis paper innovatively uses the CNN in the DL model to process the original music signal spectrogram separation processing through the HPSS algorithm. Ten, the data are input together into the multilayer volume. Finally, the training-related hyperparameters are adjusted, and the dataset is expanded to study its efect on the detection rate of music similarity. Tis work can efectively improve the detection of music similarity using a single feature.

Preprocessing of Music Signals.
Factors from diferent angles will make the extracted music signal features inaccurate and detailed. As a result, the detection accuracy of music similarity has always been unsatisfactory. Figure 1 reveals the structure used in every detection method.
According to Figure 1, the system's core element is extracting and classifying music features. Te accuracy of feature extraction determines the fnal result of similarity detection. Te two core parts of music detection are to extract music features and classify detection. It is necessary to extract as many feature quantities as possible in the music data for modeling and to detect and classify music according to the specifc detection and classifcation task [4]. Terefore, preprocessing music samples is the pivotal frst step in detection and classifcation. Tis paper adopts the Mel-Frequency Cepstral Coefcient (MFCC) based on cepstral (Cepstrum is the result of Fourier transform in the logarithmic domain of the spectrum.), which is in line with human hearing [5]. It transforms the music signal into a spectrogram through the frequency domain features of the signal. Since the sound is an analog signal, it is essential to convert the sound waveform into an acoustic feature vector [6]. Figure 2 is a fowchart of feature extraction via MFCC.
According to Figure 2, the music signal is pre-emphasized, framed, windowed, and Fourier transformed. Ten, the obtained power spectrum is passed through a triangular band-pass flter in calculating the power spectrum. Te result of the flter output is converted into a logarithmic form using the relationship between the Mel domain and the linear frequency. Finally, the Discrete Cosine Transform is performed to obtain the MFCC coefcient value [7]. A series of preliminary procedures, such as analog-to-digital conversion and pre-emphasis, must be carried out before starting the MFCC. Te analog-to-digital conversion mainly includes two tasks: sampling and quantization. Te purpose of the analog-to-digital conversion is to convert the analog signal into a digital signal. First, the sound signal wave is converted into a digital signal that is convenient for processing through a certain sampling number and sampling rate. Ten, feature extraction is performed for digital signals through MFCC [8].

CNN.
CNN is primarily used to process multidimensional array data. Te input of each layer is the three-dimensional data, i.e., a feature map, and the output of each layer is also a three-dimensional feature map. Te number of convolution kernels in each layer determines the number of three-dimensional feature maps [9]. Te early stages of the network structure are the convolution layer and pooling layer. Each neuron in the map is a part of the previous image processed by a set of flters. Ten, the result of this locally weighted sum is obtained by a nonlinear function. Since each feature map has the same flter, neurons can share weights to detect the same features in diferent parts of the image [10]. Figure 3 illustrates the convolution process. Figure 3 indicates the convolution result produced by a 3 * 3 convolution kernel on a 5 * 5 image. It can be seen that the function of the convolution layer is to locally connect the feature maps of the upper layer. Te role of the pooling layer is to combine similar features into one. Since the feature positions can be moved, the feature positions can be obtained by coarse granulation [11]. When the input data change in the position of the previous layer, pooling can make the change robust. Tere are generally two methods for the pooling layer: Average Pooling and Max Pooling [12]. Figure 4 shows the Max Pooling process.

Computational Intelligence and Neuroscience
As can be seen from Figure 4, every 2 * 2 size window selects a maximum value to obtain the value of the corresponding element of the output matrix. Te deep neural network obtains the hierarchical structure through natural signals and combines the low-level features into high-level features [13]. For example, in image processing, the local edges are integrated into the underlying pattern, then synthesized as the local image, and fnally constitute the overall image of the object [14].

Te Key to MSD Is the Feature Extraction of Music
Information. CNN consists of three parts: multilevel processing of input images, extraction of multilayer data, and representation of high-level features. Tis paper applies CNN to MSD and analyzes the infuence of the network structure parameters on the detection rate by changing them [15]. Figure 5 displays the overall framework of MSD.
In Figure 5, the original music is frst separated into harmonic and shock sound sources using the HPSS algorithm. Ten, the sound source and the original music are transformed into spectrograms through short-time Fourier transformation and input into the CNN for learning, training, and prediction. Te fnal result is the detection rate [16].

Tis Paper Mainly Uses the Harmonic/Percussive Separation Algorithm for MSD to Separate the Harmonic and Impact Sound Components in the Music Signal.
Tis algorithm relies on the anisotropic continuity of the spectrogram to separate the signal. Since the shock spectrum is continuously and smoothly distributed in frequency, the harmonic spectrum is continuously and smoothly distributed in the time direction [17]. Equation (1) is derived from the differences in the spectral representation of impact and harmonic sounds. (1) In equation (1), t represents time; f stands for the frequency index; W f,t signifes the original spectral frequency; P f,t denotes the impulse frequency spectrum, which must be greater than 0; H f,t indicates the harmonic spectrum, which must be greater than 0. Assuming that P f− 1,t − P f,t H f− 1,t , and H f,t all satisfy the independent Gaussian distribution, and the original spectrum is composed of impact and harmonic sound. Ten, the two can be separated through the minimization of equation (2) [18]. (2) In equation (2), i refers to the current iteration number; ; σH and σP represent the parametric factors for the smoothness of harmonic and percussive sounds, respectively [19].
, where F f,t denotes the original signal after Fourier transform, and c stands for a real number between 0 and 1 to correct the diference caused by the assumption [20]. Te variables are updated according to equations (3) and (4) to make the Equation take the minimum value.
In equations (3) and (4), ∆' is an auxiliary parameter, and its value is equal to where α � (σ 2 P /σ 2 H + σ 2 P ) represents the weight factor. Equations (3) and (4) can ensure that the target can converge and is monotonically decreasing. After several iterations, the results can approach the minimum value to achieve the purpose of separating music signals [21].

Network Structure.
Te frst few layers of the CNN network structure are used as a feature extractor to automatically obtain the image features through supervised training, which are detected by the SoftMax function in the fnal layer [22]. Figure 6 presents the CNN structure.
As can be seen from Figure 6, there are eight layers in CNN in total. Te frst fve layers are alternating convolution layers and Max Pooling layers, and the remaining three are fully connected layers. Te input image of CNN is the harmonic spectrum and impact spectrum generated by HPSS separation, including the original signal spectrum. Te images are unifed to 256 * 256 and input into the frst convolution flter. A flter operation is performed on the input image by 96 kernels of 11 * 11 with a stride of 4 pixels in the frst convolution layer due to the distance between the Receptive Field centers of adjacent neurons in the same core map [23]. Ten, the Max Pooling layer uses the output of the frst convolutional layer as the input and performs fltering operations with 96 kernels of size 3 * 3. After unifying the input size, the second convolutional layer performs a fltering operation on the output of the Max Pooling layer using 256 kernels of 5 * 5. Te third, fourth, and ffth convolutional layers are connected to each other. Tere is no pooling or normalization layer in between. Te third convolutional layer has a total of 384 kernels of size 3 * 3 connected to the second convolutional layer's output [24]. Te fourth convolutional layer has a total of 384 kernels of size 3 * 3, and the ffth convolutional layer has a total of 256 kernels of size 3 * 3. Finally, 256 feature maps of size 6 * 6 are obtained through these fve convolutional layers. Tese feature maps are fed to three fully connected layers, each with 4096, 1,000, and 10 neurons. Te fnal detection result is output by the last fully connected layer [25].

Network Training and Learning Methods.
Te network structure is a deep layered CNN, which extracts local features by convolving the input image and a set of kernel flters. Te convolution layer uses linear convolution flters and nonlinear activation functions to obtain feature maps. Te plane formed by the output of neurons in the same layer is the feature map, which is processed by the Pooling layer to output the convolution feature map to the next layer. Finally, diferent nuclear flters are set in the Local Receptive Field to obtain various feature maps [26]. Equation (5) indicates the convolution performed on the entire feature map and the applied nonlinear activation function.
In equation (5), X q l denotes the feature map obtained by the q-th convolution kernel in the l-th layer; ⊕ signifes the convolution operation; k pq l represents the convolution kernel; M q represents the set of X q l− 1 in the feature map, max represents the nonlinear activation function ReLU; b q l refers to the bias. Since the normalization of local responses is benefcial to the generalization of the network, ReLU processing should be performed before normalization in some layers of this network [27]. Tis normalization of the response results in an efect similar to that of lateral inhibition in real neurons, which results in a comparison of neuron output values calculated by diferent convolution kernels, making it more sensitive to the activity of larger neurons. Equation (6) describes the Pooling layer used here.
In equation (6), down means the subsampling function to get the maximum value of the feature map, which is the result obtained by calculating the feature map X q l in each n * n area group, relying on Max Pooling [28]. In CNN, the convolution layer and Pooling layer appear alternately. Since the output layer is completely connected to the previous layer, the obtained feature vector can be directly input to the logistic regression layer to process the set detection task, and the backpropagation algorithm learning method is used to process the weights in the network [29]. Te gradient of the l-th convolutional layer is calculated according to equation (7) in the learning process through backpropagation.
In equation (7), W l represents the weight of the l-th flter; b l denotes the bias vector; y l refers to the output; f represents the activation function; f′ signifes the derivative of the activation function f. (8) indicates the update rule for the weight size W l u i+1 In equation (8), i represents the iteration index; α stands for the momentum factor; μ refers to the dynamic variable; λ signifes the weight decay; η indicates the learning rate; (zL/zω)|ω l ′ D ′ represents the average value of the derivative ω′ of the loss function L with respect to ω on the i-th batch D′.
Stochastic Gradient Descent is usually used to train the network. Since the training error of the model can be reduced when the weight attenuation is small, the weight attenuation is set to 0.0005 in the model learning [30]. Dropout and Momentum are used to enhance the learning efect. Besides, Dropout is used to prevent overftting in the process of training the neural network. To reasonably shorten the processing time of network convergence, this paper sets the Dropout value in the fully connected layer to 0.510, α is set to 0.9, and λ is set to 0.0005.

Computational Intelligence and Neuroscience 5
Tere are three fully connected layers in the network structure reported here. Te last fully connected layer, the eighth layer, is the output layer. Te output of the seventh layer is the input of the output layer, containing m neurons corresponding to m types of music styles, and the output probability is P � [P 1 , P 2 , . . ., P m ]. Te Softmax regression presented in equation (9) is used.
In equation (9), (X 8 ) denotes the input of the softmax function, j stands for the current category to be calculated, and j � 1, . . ., m. Te cross-entropy function is the loss function for the network training, defned as: In equation (10), h j represents the expected output of the j-th class, and its value is zero or one. When the value is 1, it corresponds to the real class, and P j represents the real output of the j-th class.

Results and Discussion
In this experiment, the CNN model is trained through the Cafe framework to complete the detection of music similarity. First, the spectrogram of each music track is generated, and the HPSS algorithm extracts the corresponding        Computational Intelligence and Neuroscience time and frequency features in each music track. Second, all the data, such as the harmonic spectrum of time characteristic, impact spectrum of frequency characteristic, and original music signal spectrum, are conveyed to the CNN together.
Tird, the network parameters are changed, and the fnal detection result can be obtained through training and tests. Te main performance index referenced here is the detection rate. A total of 500 audio recordings containing a total of 3,000 music excerpts are used in the training and testing. Ten, the degree of the infuence of the trainingrelated hyperparameters on the detection rate is explored through particular modifcation. Table 1 lists the fnal results of hyperparameters and tuning related to tuning training.
It can be seen from Table 1 that the training-related hyperparameters will signifcantly afect the convergence and learning rate of the network, which can be obtained through the cubic plot of the detection rate. All the data in the test dataset are randomly distributed at a ratio of 5 : 1 to form two subsets. Table 1 summarizes the parameter values when the error rate of the training set becomes stable and within an acceptable range in the process of adjusting the parameters.
Due to the limited space of the article, only the impact of the learning rate η in the training-related hyperparameters is presented here. Te results are shown in Figure 7.
It can be clearly seen from Figure 7 that after 20,000 iterations, when the learning rate η is 0.001, the learning process is prolonged, but the detection rate is stable enough. Terefore, it is necessary to appropriately increase the  Computational Intelligence and Neuroscience 7 learning rate η to speed up the learning process and ensure stability. However, when the learning rate η reaches 0.1, the learning process is unstable, and the detection performance deteriorates.
It is fnally found that the training-related hyperparameters in CNN, including the learning rate, momentum coefcient, weight decay coefcient, and dropout value, can signifcantly change the network training results, which are extremely sensitive. When using the hyperparameter values set in Table 1 to conduct experiments, the detection rate in the dataset is 75.6% without expanding the experimental data.
It is fnally found that the training-related hyperparameters in CNN, whether the learning rate, momentum coefcient, the weight decay coefcient, or dropout value, can signifcantly change the network training results, which are extremely sensitive. When conducting experiments under the hyperparameter values set in Table 1, the detection rate in the dataset is 75.6% without expanding the experimental data. Te convolutional layers are divided into four, fve, and six layers to study the infuence of the number of convolutional layers on the recognition rate. Te recognition rate under diferent iterations is discussed in turn, as shown in Figure 8.
As can be seen in Figure 8, although the convergence speed of the four-layer network is faster, the recognition rate is lower than that of the deeper network as the number of iterations increases. However, although the abstraction ability is better, the recognition rate will decrease when the depth is deeper. Terefore, under normal circumstances, fve convolutional layers can already get a good image representation.
Te frst way to expand the experimental data is to increase the training samples. Firstly, image blocks of size 224 * 224 are randomly extracted from the 256 * 256 image, and each image block is smaller than the original image. Tus, the central part is included in the training set. Te second method is to enhance the training data through Principal Component Analysis (PCA). A PCA transformation is performed on each Red, Green, and Blue (RGB) for denoising to ensure the richness of RGB images. Ten, random scale factors are added to each feature value, and new scale factors are regenerated in each round. Tis operation can signifcantly change the salient features in the same image and reduce the chance of overftting in the process. Before and after data expansion, the features of time series and frequency series are manually extracted and put into CNN for training in diferent combinations. Figure 9 provides the specifc efect.
According to Figure 9, diferent efects are obtained before and after data expansion when the features of manually extracted time series and frequency series are put into CNN training in diferent combinations. A better detection rate can be obtained when all three feature maps are entered. Te results fully illustrate the necessity of comprehensive features. Figure 8 also suggests that the results are signifcantly improved when the experimental data are fully expanded. Because CNN has many parameters, sufcient training image data can ensure the efectiveness of training. Tus, the process of data expansion is essential to obtain robustness for more image samples and various diferences.
Trough continuous research, it has been found that the changes in music repertoire are vibrant, but the amount of data used is far from enough. Besides, the current training data cannot achieve perfect results for the eight-layer network structure used here. Not surprisingly, more training data can gradually improve the detection achieved so far. Figure 10 compares the detection rate of the algorithm reported here with the existing detection methods.
According to Figure 10, the Gwardys method uses the HPSS algorithm to obtain the spectrogram, and the fnal detection rate is 72.2%, which is higher than that of this CNN method. Lee's method only trains a two-layer Convolutional Deep Belief Network (CDBN). Te depth of the CDBN detection model is shallower than the CNN, but the accuracy is not low, indicating that shallow networks can also produce ideal results in small datasets. Yang uses the K-Means Clustering algorithm for detection, which belongs to the category of machine learning, and the fnal detection rate is only 70.6%. It can be seen that the detection rate of the DL method reported here improves to a certain extent.
After the above similarity detection method, this paper classifes music styles in the form of a confusion matrix based on the GTZAN dataset. It is the most commonly used public dataset in machine hearing research to evaluate music genre recognition. Te results are shown in Table 2.
As can be seen from Table 2, the correct classifcation percentage is on the diagonal of the matrix. Because the boundaries of some music styles are not clear enough, it is easy to cause misjudgment. For example, some classical music is easily mistaken for blues music; disco music is also easy to be mistaken for popular styles. As a result, the classifcation accuracy of diferent types of music is not the same.

Conclusion
Tis paper proposes an MSD method based on CNN. Te network framework used by the method was designed in detail, and some key factors afecting its detection rate performance were studied. Using the framework of CNNs makes it possible to apply DL to small datasets. At frst, the detection rate was only 67.1% when the original spectrogram was used for the experiment. Te training-related hyperparameters were adjusted, and data expansion was carried out to improve the results. After these operations, the fnal detection rate reached about 75.6%, making a particular improvement compared with several scholars' previous results. Finally, music similarity detection is applied for music style classifcation. Due to the limitation of time, space, and personal ability, the detection rate has not achieved breakthrough progress but only improved compared with other methods, indicating that the advantages of CNNs have not been fully exerted. Future research will continue to strive to make greater progress as soon as possible.

Data Availability
Te datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest
Te authors have no conficts of interest.